# Multimodal Instruction Understanding
Pixelreasoner RL V1
Apache-2.0
PixelReasoner is a vision-language model based on Qwen2.5-VL-7B-Instruct, trained with curiosity-driven reinforcement learning, focusing on image-text-to-text tasks.
Image-to-Text
Transformers English

P
TIGER-Lab
112
3
Jedi 7B 1080p
Apache-2.0
Qwen2.5-VL-7B-Instruct is a multimodal model based on the Qwen2.5 architecture, supporting joint processing of images and text, suitable for vision-language tasks.
Image-to-Text
Safetensors English
J
xlangai
239
2
Llama 4 Scout 17B 16E Instruct FP8 Dynamic
Other
A 17B-parameter multilingual instruction model based on Llama-4, optimized with FP8 quantization to significantly reduce resource requirements
Image-to-Text Supports Multiple Languages
L
RedHatAI
5,812
8
Qwen.qwen2.5 VL 32B Instruct GGUF
Qwen2.5-VL-32B-Instruct is a 32B-parameter-scale multimodal vision-language model that supports joint understanding and generation tasks for images and text.
Image-to-Text
Q
DevQuasar
27.50k
1
Qwen2.5 VL 32B Instruct W4A16 G128
Apache-2.0
Qwen2.5-VL-32B-Instruct is a 32B-parameter multimodal large language model supporting vision and language tasks, suitable for complex multimodal interaction scenarios.
Image-to-Text
Q
leon-se
16
2
Qwen2 VL 7B Visual Rft Lisa IoU Reward
Apache-2.0
Qwen2-VL-7B-Instruct is a vision-language model based on the Qwen2 architecture, supporting multimodal input of images and text, suitable for various visual-language tasks.
Image-to-Text
Safetensors English
Q
Zery
726
4
Qwen 2 VL 7B OCR
Apache-2.0
A fine-tuned version of the Qwen2-VL-7B model, trained using Unsloth and Huggingface's TRL library, achieving a 2x speed improvement.
Text-to-Image
Transformers English

Q
Swapnik
103
1
Llama 3.2 11B Vision OCR
Apache-2.0
Llama 3.2-11B vision-instruction model optimized with Unsloth, 4-bit quantized version, training speed increased by 2x
Large Language Model
Transformers English

L
Swapnik
80
1
Llama 3 2 11b Vision Electrical Components Instruct
MIT
Llama 3.2 11B Vision Instruct is a multimodal model combining vision and language, supporting image-to-text tasks.
Image-to-Text English
L
ankitelastiq
22
1
Qwen2.5 VL 7B Instruct 4bit
Apache-2.0
A multimodal model fine-tuned based on Qwen2.5-VL-7B-Instruct, utilizing the Unsloth acceleration framework and TRL library for training, achieving a 2x speed improvement
Text-to-Image
Transformers English

Q
jarvisvasu
180
1
Pixtral Large Instruct 2411
Other
Pixtral-Large-Instruct-2411 is a multimodal instruction fine-tuned model based on MistralAI technology, supporting image and text input with multilingual processing capabilities.
Image-to-Text
Transformers Supports Multiple Languages

P
nintwentydo
23
2
Qwen2 VL 7B Instruct GGUF
Apache-2.0
Qwen2-VL-7B-Instruct is a 7B-parameter multimodal model supporting image-text interaction tasks.
Image-to-Text English
Q
gaianet
102
2
Qwen2 VL 7B Instruct Onnx
Apache-2.0
This is a vision-language model based on the Qwen2-VL architecture with 7B parameters, supporting image understanding and instruction interaction.
Text-to-Image
Transformers

Q
pdufour
47
4
Openvla 7b Finetuned Libero 10
MIT
This model is a vision-language-action model obtained by fine-tuning the OpenVLA 7B model using the LoRA method on the LIBERO-10 dataset, suitable for the field of robotics.
Image-to-Text
Transformers English

O
openvla
1,779
2
Openvla 7b Finetuned Libero Goal
MIT
This is an OpenVLA 7B vision-language-action model fine-tuned using LoRA technology on the LIBERO-Goal dataset, suitable for the robotics field.
Image-to-Text
Transformers English

O
openvla
746
1
Featured Recommended AI Models